Basic statistics

Jürgen Wilbert

University of Potsdam

9/2/22

Goals

  • You know some functions for basic statistical analyses.
  • You can create contingency tables.
  • You learn to integrate filtering, selecting, grouping, and subsetting of data frames with functions for statistical analyses.

Descriptive statistics

Some basic statistical functions

  • min(), max(): Minimum and maximum
  • mean(), median(): Mean and median
  • sd(), var(): Standard deviation and variance
  • mad(): Median absolute deviation
  • quantile(): Percentile / quantile

Task

  • Take the mtcars dataset
  • Create a new variable lpk (liters per 100km) from mpg
    (Formula: lpk = 282.5 / mpg)
  • Calculate grouped by cylinders (cyl): mean, sd, median, mad, min, max of lpk
  • Round all values by one decimal
  • Hint: Use the tidyverse or dplyr library

:-)

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(cyl) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  ) %>%
  round(1) 
cyl mean sd median mad min max
4 10.9 1.8 10.9 2.3 8.3 13.2
6 14.4 1.1 14.3 1.3 13.2 15.9
8 19.3 3.8 18.6 1.9 14.7 27.2

What is happening here?

mtcars <- mutate(mtcars, 
  transmission = factor(am, labels = c("Manual", "Automatic"))
)

Now, lets calculate the stats for the transmission type:

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(transmission) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  ) 
transmission mean sd median mad min max
Manual 17.35861 4.339156 16.32948 3.344815 11.577869 27.16346
Automatic 12.33851 3.341021 12.39035 3.028011 8.333333 18.83333

… and round the results

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(transmission) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  ) %>%
  round(1)
Error in Math.data.frame(list(transmission = 1:2, mean = c(17.3586131968484, : 
non-numeric-alike variable(s) in data frame: transmission

ups :-( … what went wrong?

Solution: only round variables that are numeric:

  • is.numeric() returns a TRUE or FALSE for a vector (is.numeric(1:5); is.numeric(c("A", "B")))
  • across() specifies for mutate()which variable to select.

mutate(across(.cols, .fns, ... )):

  • .cols := columns so select
  • .fns := a function to apply
  • … := arguments to that function

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(transmission) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  ) %>%
  mutate(across(where(is.numeric), round, 1))
transmission mean sd median mad min max
Manual 17.4 4.3 16.3 3.3 11.6 27.2
Automatic 12.3 3.3 12.4 3.0 8.3 18.8

Tables

Tables in base R

  • table(): Shows frequencies of nominal scaled variables.
  • prop.table(): Calculates proportions from frequency tables.
  • addmargins(): Adds margins to tables.

Example one dimensional

tab <- table(mtcars$cyl) # frequencies
tab

 4  6  8 
11  7 14 
prop.table(tab) # proportions

      4       6       8 
0.34375 0.21875 0.43750 
prop.table(tab) * 100 # percentages 

     4      6      8 
34.375 21.875 43.750 

Example two dimensional

tab <- table(mtcars$cyl, mtcars$am)
tab
   
     0  1
  4  3  8
  6  4  3
  8 12  2
prop.table(tab) * 100
   
         0      1
  4  9.375 25.000
  6 12.500  9.375
  8 37.500  6.250

Example two dimensional with percentages by rows and columns

tab <- table(mtcars$cyl, mtcars$am)
prop.table(tab, margin = 1) * 100  # sum of each row is 100%
   
           0        1
  4 27.27273 72.72727
  6 57.14286 42.85714
  8 85.71429 14.28571
prop.table(tab, margin = 2) * 100 # sum of each column is 100%
   
           0        1
  4 15.78947 61.53846
  6 21.05263 23.07692
  8 63.15789 15.38462

Example with added margins

tab <- table(mtcars$cyl, mtcars$am)
tab <- prop.table(tab, margin = 2) * 100
addmargins(tab)
     
              0         1       Sum
  4    15.78947  61.53846  77.32794
  6    21.05263  23.07692  44.12955
  8    63.15789  15.38462  78.54251
  Sum 100.00000 100.00000 200.00000

Task

  • Take the heights dataset from the dslabs library.
  • Create a cross table depicting: The percentage of females and males that are above and below a height of 170 cm.

library(dslabs)

heights$category[heights$height * 2.54 <= 170] <- "% <= 170 cm"
heights$category[heights$height * 2.54 > 170] <- "% > 170 cm"

tab <- table(heights$sex, heights$category)
tab <- prop.table(tab, margin = 1) * 100
tab <- round(tab, 1)
tab
        
         % <= 170 cm % > 170 cm
  Female        71.0       29.0
  Male          19.8       80.2

Tables with dplyr

mtcars %>% 
  group_by(cyl, am) %>% 
  summarise(n = n())
cyl am n
4 0 3
4 1 8
6 0 4
6 1 3
8 0 12
8 1 2

Shortcut for simple count tables

# Because it is shorter, I don't use pipes here (but I could!)
count(mtcars, cyl) 
cyl n
4 11
6 7
8 14
count(mtcars, cyl, am) 
cyl am n
4 0 3
4 1 8
6 0 4
6 1 3
8 0 12
8 1 2

Tables with dplyr: pivot_wider()

pivot_wider():
creates separate variables from the levels of a factor variable and the values of a second variable

mtcars %>% count(cyl, am)
cyl am n
4 0 3
4 1 8
6 0 4
6 1 3
8 0 12
8 1 2
mtcars %>% count(cyl, am) %>%
  pivot_wider(names_from = "am", values_from = "n")
cyl 0 1
4 3 8
6 4 3
8 12 2

Tables with dplyr: A bit nicer!

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  count(cyl, am) %>% 
  pivot_wider(names_from = "am", values_from = "n") %>%
  rename("Cylinders" = "cyl")
Cylinders Manual Automatic
4 3 8
6 4 3
8 12 2

Task

Recreate the previous task. This time using dplyr:

  • Take the heights dataset from the dslabs library.
  • Create a cross table depicting: The percentage of females and males that are above and below a height of 170 cm.

This is a hard task …

Hint 1: ifelse(height * 2.54 <= 170, "size_a", "size_b")
Hint 2:
size_a / (size_a + size_b) * 100
size_b / (size_a + size_b) * 100
are the row proportions.

library(dslabs)

heights %>%
  mutate(category = ifelse(height * 2.54 <= 170, "size_a", "size_b")) %>% 
  count(sex, category) %>%
  pivot_wider(names_from = "category", values_from = "n") %>%
  mutate(
    "% <= 170" = size_a / (size_a + size_b) * 100,
    "% > 170" = size_b / (size_a + size_b) * 100,
    across(where(is.numeric), round, 1)
  ) %>%
  select(1, 4, 5)
sex % <= 170 % > 170
Female 71.0 29.0
Male 19.8 80.2

Tables with dplyr: For values of a third variable

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  group_by(cyl, am) %>% 
  summarise(
    n = n(), 
    M = mean(mpg), 
    SD = sd(mpg)
  ) 
cyl am n M SD
4 Manual 3 22.90000 1.4525839
4 Automatic 8 28.07500 4.4838599
6 Manual 4 19.12500 1.6317169
6 Automatic 3 20.56667 0.7505553
8 Manual 12 15.05000 2.7743959
8 Automatic 2 15.40000 0.5656854

pivot_wider() can take values from several variables:

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  group_by(cyl, am) %>% 
  summarise(n = n(), M = mean(mpg), SD = sd(mpg)) %>%
  pivot_wider(names_from = "am", values_from = c("n", "M", "SD")) %>%
  round(1)
cyl n_Manual n_Automatic M_Manual M_Automatic SD_Manual SD_Automatic
4 3 8 22.9 28.1 1.5 4.5
6 4 3 19.1 20.6 1.6 0.8
8 12 2 15.1 15.4 2.8 0.6

Set argument names_vary = "slowest" for a different ordering of variables:

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  group_by(cyl, am) %>% 
  summarise(n = n(), M = mean(mpg), SD = sd(mpg)) %>%
  pivot_wider(names_from = "am", values_from = c("n", "M", "SD"), names_vary = "slowest") %>%
  round(1)
cyl n_Manual M_Manual SD_Manual n_Automatic M_Automatic SD_Automatic
4 3 22.9 1.5 8 28.1 4.5
6 4 19.1 1.6 3 20.6 0.8
8 12 15.1 2.8 2 15.4 0.6

Task

  • Take the storms dataset and calculate n and mean of the wind speed (wind) by month (month) and storm classification (status) in a crosstable.

storms %>% 
  group_by(month, status) %>%
  summarise(
    n = n(),
    M = mean(wind, na.rm = TRUE)
  ) %>%
  pivot_wider(names_from = "status", values_from = c("n", "M")) %>%
  round()
month n_hurricane n_tropical depression n_tropical storm M_hurricane M_tropical depression M_tropical storm
1 5 2 23 70 30 48
4 NA NA 13 NA NA 45
5 NA 40 50 NA 28 42
6 18 169 187 70 27 44
7 100 349 474 81 26 45
8 776 717 1214 85 27 45
9 1793 1085 2016 89 28 46
10 701 379 913 85 28 46
11 187 136 387 83 28 48
12 33 21 71 68 30 44

Design example:

Replicate and try to understand the following table example:

library(kableExtra)
tab <- mtcars %>% 
  group_by(cyl, am) %>% 
  summarise(n = n(), M = mean(mpg), SD = sd(mpg)) %>%
  pivot_wider(names_from = "am", values_from = c("n", "M", "SD"), names_vary = "slowest") 

names(tab) <- c("Cylinders", "n", "M", "SD", "n", "M", "SD")

kable(tab, 
  caption =  "Table 1.<br>Miles per gallon by cylinders and gearshift", 
  digits = 1, 
  full_width = FALSE
) %>% 
  kable_classic() %>%
  add_header_above(c(" " = 1, "Automatic" = 3, "Manual" = 3))

Table 1.
Miles per gallon by cylinders and gearshift
Automatic
Manual
Cylinders n M SD n M SD
4 3 22.9 1.5 8 28.1 4.5
6 4 19.1 1.6 3 20.6 0.8
8 12 15.1 2.8 2 15.4 0.6

Inferential statistics

Some statistical functions:

  • t.test(): Calculating a t-test.

  • wilcox.test(): Calculating Wilcox test / U-Test.

  • chisq.test(): Calculating a Pearson \(X^2\)-test.

  • Some more we will not address today:

    • lm(): Regression analyses
    • cor.test(): Calculating a correlation test.
    • binom.test(): Binomial test.
    • fisher.test(): Fisher exact test for count data.
    • ks.test(): One and two sample Kolmogorov-Smirnov Tests.
    • shapiro.test(): Shapiro-Wilk Normality Test.
    • aov(): Analysis of variance

chisq.test()

  • Pearson’s \(X^2\) test is a very versatile test to calculate whether a distribution of observed frequencies equates expected frequencies.
  • A very common application is to test whether frequencies of observations in two variables are related. (E.g. number of patients that improved in a treatment vs. control group).
  • The chisq.test() functions takes a two dimensional frequency table and calculates a \(X^2\) test.

Example with data from an intervention study

# get an external data example
dat <- read.csv("https://goo.gl/j6lRXD")
# Create a two distribution table
tab <- table(dat$treatment, dat$improvement)
tab
             
              improved not-improved
  not-treated       26           29
  treated           35           15
# Test for non random distribution:
chisq.test(tab) # Alternatively: chisq.test(dat$treatment, dat$improvement)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tab
X-squared = 4.6626, df = 1, p-value = 0.03083

Significantly more patients improved in the treatment condition (\(X^2(1)=4.7, p < .05\))

Task

  • Take the starwars dataset from the tidyverse package.
  • Use table() to get the distribution eye_color by hair_color
  • Apply the chisq.test() to calculate a X² test to test for an even distribution.

:-)

tab <- table(starwars$eye_color, starwars$hair_color)
chisq.test(tab)

    Pearson's Chi-squared test

data:  tab
X-squared = 210.67, df = 154, p-value = 0.001653

t.test()

  • Student’s t-test analysis whether two samples originate from the same normal distribution.

  • It is used to test for mean differences in two groups.

  • Arguments of the t.test() function:

    • x and y: Each variable provides data from a samples.
    • formula: If you have one vector with all data (e.g. values) and a second vector with grouping information (e.g. group) use: values ~ group.
    • data: If you work with formula: The name of a data frame.
    • paired: If set TRUE, the test assumes repeated measures of one sample instead of two independent samples.

Example

Does one of two drugs increase hours of sleep better?

extra: change in sleep duration, group : drug given

sleep
extra group ID
0.7 1 1
-1.6 1 2
-0.2 1 3
-1.2 1 4
-0.1 1 5
3.4 1 6
3.7 1 7
0.8 1 8
0.0 1 9
2.0 1 10
1.9 2 1
0.8 2 2
1.1 2 3
0.1 2 4
-0.1 2 5
4.4 2 6
5.5 2 7
1.6 2 8
4.6 2 9
3.4 2 10

Example

# Applying a T-test by providing x and y
x <- sleep$extra[sleep$group == 1]
y <- sleep$extra[sleep$group == 2]
t.test(x, y)

# Applying a T-test with a formula
t.test(extra ~ group, data = sleep)

Example


    Welch Two Sample t-test

data:  extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 -3.3654832  0.2054832
sample estimates:
mean in group 1 mean in group 2 
           0.75            2.33 

Group 2 shows an increase in sleep length of \(\Delta M=1.58\) (\(t(17.8)=1.86, p = .08\))

wilcox.test()

  • Calculates a Wilcoxon rank sum test. Also known as Mann Whitney U-Test
  • This test is applied as an alternative to a t-Test when data are assumed to be non-normal distributed.
  • It takes the same arguments as t.test()

Task

  • Take the sleep dataset from the previous example.
  • Calculate the median and mad (Median absolute deviation) for extra for each group.
  • Calculate a Wilcoxon test with the sleep dataset on the effectiveness of the intervention.

:-)

library(dplyr)
sleep %>% group_by(group) %>%
  summarise(
    median = median(extra), 
    mad = mad(extra)
  )
group median mad
1 0.35 1.55673
2 1.75 2.44629
wilcox.test(extra~group, data = sleep)

    Wilcoxon rank sum test with continuity correction

data:  extra by group
W = 25.5, p-value = 0.06933
alternative hypothesis: true location shift is not equal to 0

Task

  • Install the “dslabs” package
  • Take the gapminder dataset. This dataset includes health and income outcomes for 184 countries from 1960 to 2016.
  • It includes infant_mortality, region and continent.
  • Calculate the mean and median infant_mortality for each continent
  • Calculate a t-test comparing infant_mortality between the regions Southern- and Eastern Europe.

:-)

install.packages("dslabs")
library(dslabs)
gapminder %>% group_by(continent) %>% 
  summarise(
    mean = mean(infant_mortality, na.rm = TRUE),
    median = median(infant_mortality, na.rm = TRUE)
  )
continent mean median
Africa 95.12395 93.40
Americas 42.88145 30.80
Asia 55.26174 43.10
Europe 15.33022 11.25
Oceania 39.10136 29.10

x <- gapminder %>% filter(region == "Southern Europe") %>% select(infant_mortality)
y <- gapminder %>% filter(region == "Eastern Europe") %>% select(infant_mortality)
t.test(x, y)

# or with pipes
gapminder %>% 
  filter(region %in% c("Southern Europe", "Eastern Europe")) %>%
  t.test(infant_mortality ~ region, data = .)

    Welch Two Sample t-test

data:  infant_mortality by region
t = 0.53658, df = 929.22, p-value = 0.5917
alternative hypothesis: true difference in means between group Eastern Europe and group Southern Europe is not equal to 0
95 percent confidence interval:
 -1.471847  2.579544
sample estimates:
 mean in group Eastern Europe mean in group Southern Europe 
                     20.36164                      19.80779